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Abstract. This review traces the evolution of theory that started when 
Charles Stein in 1955 [In Proc. 3rd Berkeley Sympos. Math. Statist. 
Probab. I (1956) 197-206, Univ. California Press] showed that using 
each separate sample mean from A; > 3 Normal populations to estimate 
its own population mean fii can be improved upon uniformly for every 
possible fj, = (/ii, . . . ,//fc)'- The dominating estimators, referred to here 
as being "Model-I minimax," can be found by shrinking the sample 
means toward any constant vector. Admissible minimax shrinkage es- 
timators were derived by Stein and others as posterior means based on 
a random effects model, "Model-II" here, wherein the fii values have 
their own distributions. Section 2 centers on Figure 2, which organizes 
a wide class of priors on the unknown Level-II hyperparameters that 
have been proved to yield admissible Model-I minimax shrinkage esti- 
mators in the "equal variance case." Putting a flat prior on the Level-II 
variance is unique in this class for its scale-invariance and for its con- 
jugacy, and it induces Stein's harmonic prior (SHP) on /ij. 

Component estimators with real data, however, often have substan- 
tially "unequal variances." While Model-I minimaxity is achievable in 
such cases, this standard requires estimators to have "reverse shrink- 
ages," as when the large variance component sample means shrink 
less (not more) than the more accurate ones. Section 3 explains how 
Model-II provides appropriate shrinkage patterns, and investigates es- 
pecially estimators determined exactly or approximately from the pos- 
terior distributions based on the objective priors that produce Model-I 
minimaxity in the equal variances case. While correcting the reversed 
shrinkage defect, Model-II minimaxity can hold for every component. 
In a real example of hospital profiling data, the SHP prior is shown to 
provide estimators that are Model-II minimax, and posterior intervals 
that have adequate Model-II coverage, that is, both conditionally on 
every possible Level-II hyperparameter and for every individual com- 
ponent fii, i = 1,. . . ,k. 

Key words and phrases: Hierarchical model, empirical Bayes, unequal 
variances, Model-II evaluations. Stein's harmonic prior. 
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1. INTRODUCTION: STEIN AND 
SHRINKAGE ESTIMATION 

Charles Stein [23] stunned the statistical world 
by showing that estimating k population means fi = 
(//I, . . . , /ifc)' with their sample means y = {yi, . . . , y^Y 
is inadmissible. That result assumes k>3 indepen- 
dent Normal distributions and a sum of mean squa- 
red component errors risk function. With Willard 
James [14], he provided a specific shrinkage estima- 
tor, the James-Stein estimator, which dominates the 
sample mean vector very substantially. 

This first section introduces the history of the 
James-Stein minimax estimator and its extensions 
when "equal variances" prevail, with "Model-I" eval- 
uations that are conditional on /i. However, "Mo- 
del-I" does not allow certain practical needs to be 
met, such as valid confidence intervals. Section 2 
shows how this has been rectified by enlarging "Mo- 
break del-I" to "Model-II" wherein random effects 
distributions are assigned in Level-II. The result- 
ing framework enables repeated sampling (frequency 
based) interval estimates [9, 17] and frees practition- 
ers from determining and specifying valid relative 
weights for each squared-error component loss, upon 
which Model-I minimax estimators depend critically. 
Model-II even supports developing admissible min- 
imax shrinkage estimators via posterior mean cal- 
culations by simplifying the specification of prior 
distributions, proper and otherwise, on the Level- 
II parameters. The centerpiece of Section 2 is Fig- 
ure 2, which graphically organizes some priors on 
the Level-II variance that lead to minimax estima- 
tors. Stein's harmonic prior (SHP) in Figure 2 cor- 
responds to an admissible shrinkage estimator that 
provides acceptable frequency coverage intervals in 
Model-II evaluations. 

Section 3 reviews the unequal variances case that 
arises regularly in practice, but for which mathe- 
matical evaluations are difficult. The previous sec- 
tions are meant especially to provide the background 
needed for more research on the operating character- 
istics in repeated sampling of unequal variances pro- 
cedures, while Section 3 shows why that is needed. 
It is shown in Section 3 why, in substantially un- 
equal variances settings, the Model-II random ef- 
fects framework works well while the Model-I per- 
spective provides inappropriate ("reversed") shrink- 
age patterns. The SHP prior in that setting leads to 
estimators and to formal posterior intervals for /ij, 
i = l, . . . ,k, that appear to provide approximate (or 
conservative) frequency confidence intervals with re- 



Table 1 

Hospital profiling data and James-Stein shrinkage estimates 

for fc = 10 NY hospitals 
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Vi 


sdi 


Vi 


Bjs 


p.JS,i 


1 


-2.15 


1.0 


1.0 


0.688 


-0.67 


2 


-0.34 


1.0 


1.0 


0.688 


-0.11 


3 


-0.08 


1.0 


1.0 


0.688 


-0.02 


4 


0.01 


1.0 


1.0 


0.688 


0.00 


5 


0.08 


1.0 


1.0 


0.688 


0.02 


6 


0.57 


1.0 


1.0 


0.688 


0.18 


7 


0.61 


1.0 


1.0 


0.688 


0.19 


8 


0.86 


1.0 


1.0 


0.688 


0.27 


9 


1.11 


1.0 


1.0 


0.688 


0.35 


10 


2.05 


1.0 


1.0 


0.688 


0.64 



spect to Model-II evaluation standards that for each 
individual /ij approximates or exceeds its nominal 
95% coverage, no matter what the true Level-II vari- 
ance. 

The data in Table 1 provide an equal variance 
example based on a 1992 medical profiling evalua- 
tion of /c = 10 New York hospitals. We are to con- 
sider these as Normally-distributed indices of suc- 
cessful outcome rates for patients at these 10 hospi- 
tals following coronary artery bypass graft (CABG) 
surgeries. The indices are centered so that the New 
York statewide average outcome over all hospitals 
lies near 0. Larger estimates y^ indicate hospitals 
that performed better for these surgeries. For ex- 
ample. Hospital 10 was more than 2 standard de- 
viations above the statewide mean. All 10 sample 
means have nearly the same variances, which we 
have scaled so the common variance is about V = 
1.00. The variances Vi must be the same in order to 
meet the equal variance assumption upon which the 
James-Stein estimator is based. This "equal vari- 
ance" case enables various mathematical calcula- 
tions that are difficult, if not impossible, for the 
widely encountered "unequal variances" situation. 

The vector of sample means y has total mean 
squared error (risk) as an estimator of fi given by 



EY.(y^ 



Aij 



k 

E 

j=i 



V, = kV. 



This unbiased estimator is minimax, since its con- 
stant risk is the limit of the risks of a sequence of 
proper Bayes' rules (see, e.g.. Theorem 18 of Chap- 
ter 5 in [3]). 

In the simplest situation, the James-Stein estima- 
tor "shrinks" yi toward an arbitrarily preassigned 
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constant /io- It is appropriate to set /Uq = in this 
case because we have recentered the CABG indices 
to have NY statewide mean equal to 0. Then with 
//o = 0, the sum of squared residuals for these data, 



10 

E 

j=i 



y2/F = 11.62, 



would have a Xf lo) distribution if the hypothesis that 
all values of /Zj = /zq = were true, thereby failing 
to reject the null at even the 30% level. However, 
most members of the medical community would not 
believe that all hospitals are equally effective, and 
many in the statistical community would be reluc- 
tant to think that the first and last hospitals in the 
list, whose quality estimates differ by more than 2 
standard deviations from 0, should be declared to 
have the same underlying quality as all the others. 

On the other hand, S isn't far from its expectation 
A; = 10 if all the fii are 0, and some extreme rates 
would occur, at least in part, because of randomness. 
Thus, regression-toward-the-mean (RTTM), that is, 
shrinkage toward fiQ, would be expected if more data 
were to appear for these hospitals. 

RTTM is anticipated if one believes that there 
is some similarity among the hospitals, and that 
sampling variation is part of the reason for the ex- 
treme hospitals. That is, the hospital with the high- 
est quality index with yio = 2.05 probably has a true 
mean /iio smaller than 2.05 because 



E 



max Vila 



> max Elijj 
i<i<k ^" 



fJ-i 



max LLj 

l<i<k 



>/^10 



(by Jensen's inequality and convexity of the maxi- 
mum function) . So we expect in this case that the ob- 
served maximum yiQ = 2.05 exceeds hiq, and a shrun- 
ken estimator is in order. The two-level Model-II, 
soon to be described, anticipates and models RTTM, 
leading to shrinkage estimation. 



Following earlier notation set in a series of papers 
by Efron and Morris, for example, [9], about Stein's 
estimator and its generalizations, we denote shrink- 
age factors by the letter B (often with subscripts). 
The James-Stein shrinkage coefficient for this set- 
ting is calculated as 

^js = {k- 2)/S, 

which for these data is Bjs = 8/11.62 = 0.688. This 
estimator then shrinks the usual unbiased estima- 
tes yi toward /io = according to 

fj'js,i = (1 - Bjs)yi + -Bjs/Uo = (1 - Bjs)yi- 

Based on this shrinkage estimate, future observa- 
tions are being predicted to regress about 68.8% of 
the way toward 0. Column 5 of Table 1 lists the 
shrunken values 

(1 - 0.688) xyi + 0.688 x = 0.312 x yi 

for each hospital, the James-Stein estimate of the 
mean. For example, the estimate of Hospital lO's 
quality index is reduced from 2.05 standard devia- 
tions above the New York mean to 0.64 standard 
deviations. The RTTM effect is strong for these 10 
hospitals, which are estimated to be more similar 
than different, with only 31.2% of the weight allo- 
cated to each hospital's own estimate. Figure 1 il- 
lustrates the shrinkage pattern. 

The parameter //j can be thought of as the quality 
index that would result for hospital i if that hospital 
theoretically could have performed a huge number 
of CABG surgeries in 1992. Whether the JS estima- 
tor of quality is a better estimator of fi than y for 
these data cannot be guaranteed because the true 
values of // aren't known. However, one can calcu- 
late an unbiased estimator of the expected risk (i.e., 
for sum of squared errors) of the JS estimator [14]. 
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Fig. 1. Unbiased (top) versus James-Stein (bottom) estim,ates for 10 NY hospitals. 



C. N. MORRIS AND M. LYSY 



This unbiased estimator of the risk is 

a function of y only through S. That y is inadmis- 
sible and that the JS estimate is "minimax" (risk 
never exceeds kV) follows because -Bjs > for all 
data sets. This proves minimaxity, that the risk of 
Ajs = (Ajs,1' • • • i/*JS,fc)' as a function of /x is 

E[R\ii] = V{k -{k- 2)EBjs) < kV. 

For these data, R = 1.00 x (10- 8 x 0.688) = 4.496. 
This is a large reduction in mean squared error, less 
than half of kV = 10, the risk of the separate un- 
shrunken estimates yi. In fact, the smallest possible 
value of the risk for the JS estimator is 2V , when 
// = 0, for any value of A; > 3, thus offering very sub- 
stantial possible improvements on y. 

The JS estimator can be extended in the equal 
variance setting to cover more general situations. 
For example, as Stein and others showed, one can 
shrink the yi toward the grand mean of the data, 
y = ^yijk. With these hospital data this would 
shrink toward y = 0.272 (which differs by less than 
one standard error of the overall average for the 10 
hospitals from the assumed mean 0). More gener- 
ally, if along with each yi one collects a vector of 
r > covariate vectors Xj (possibly including the in- 
tercept), each yi can be shrunk toward its regression 
prediction x^6, where 

h = {X'XY^X'y 

and X is the fc x r covariate matrix with columns Xi , 
i = l, . . . ,k. Doing this forfeits r degrees of freedom, 
so that 

Bjs = {k-r- 2)/S, 

with S now replaced by 



S = Y,{m-x',b)'/V. 



i=l 

The James-Stein estimates of the /ij then become 

Ajs,i = (1 - Bjs)yi + Bjsx'ib 

= {l-Bjs){yi-x'ib) + x[b. 

Writing fijs.i this way suggests that shrinking with 
r > does not affect the r-dimensional regression 
space, but only shrinks toward in the k — r di- 
mensional space orthogonal to it. Indeed, the prob- 
lem can be "rotated" to an equivalent one in which 



the last r values of the residuals yi — x[b are all 
equal to 0, regardless of the value of y, for example. 
Stein [24]. The example just considered, with shrink- 
age toward zero, shows what happens to the resid- 
uals when shrinkage is toward a regression model. 

Of course V needn't be 1.00, or even be known, 
provided there exists an independent Chi-square es- 
timate of V . While that can be handled straightfor- 
wardly in the equal variance case [24] , it will not be 
a central issue in any case if the degrees of freedom 
are substantial. 

Using the JS estimator seems easy and powerful, 
but many complicating issues arise in practice: 

1. What is the standard error of each individual es- 
timate? One hopes the JS estimator for Hospi- 
tal 10 improves yio = 2.05 (with standard devia- 
tion = 1.00) by using the better estimate /iio = 
0.64. The sum of individual risks has decreased 
from 10 to 4.5 for all 10 hospitals, but this does 
not mean the variance for each individual esti- 
mate has dropped to 0.45. Furthermore, the JS 
estimator cannot even guarantee that every com- 
ponent (hospital) has a smaller risk (expected 
squared error) as a function of /U. Such an im- 
provement is impossible because each individual 
yi is an admissible estimate of its own //j , in one 
dimension. Rather, minimaxity of the JS esti- 
mator for sum of squared errors is accomplished 
by "balancing" or "trading off" component risks. 
Components with mean square errors that ex- 
ceed V are guaranteed to have their risks more 
than offset by risk improvements on the remain- 
ing components. The minimaxity claim (improve- 
ment on the unshrunken vector of unbiased esti- 
mates) is for aggregate risk, and not for every 
component. 

2. Why, even in this equal variance case, should the 
loss function be an unweighted sum of squares? In 
applications the loss function could require differ- 
ent relative weights to reflect unequal economic 
loss for the mean squared errors of different com- 
ponents (hospitals, here). That is, the appropri- 
ate loss function could be 



L{fi,^l) = ^Wi{^li 



IJ-i 



1=1 



for some appropriate weights Wi, . . . , Wk > 0. 

Users of the James-Stein estimator typically 
assume that all Wi are equal in assessing its risk 
benefits. But would NY hospital administrators 
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agree that hospital errors can be traded off with 
equal weights? Perhaps weights should differ for 
teaching hospitals, or for military hospitals, or 
for children's or other specialty hospitals, or for 
hospitals in areas far from medical centers, or for 
large hospitals. Getting agreement on that issue 
has arisen with various real shrinkage applica- 
tions. Even if the administrators could agree on 
the values of the Wi, the James-Stein estima- 
tor would not dominate y when the Wi are suf- 
ficiently unequal. There is a way out that seems 
reassuring, at first, because a shrinkage estima- 
tor can be found to dominate y for any given 
weights Wi. But there is a rub. The dominat- 
ing estimator for a set of weights depends on the 
specified weights, and then it cannot be expected 
to dominate y for a different set of weights. Only 
the unshrunken estimator y can be guaranteed 
to be minimax independently of the weights Wi. 
Its risk, the minimax risk, is ^ = ^ Wi . More on 
this in Section 3. 

3. Even with equal weights, Wi = l, another problem 
arising in practice and in the theory is that Bjg 
can exceed 1. A (uniformly) better shrinkage con- 
stant uses min(l,i?js) instead and easily is seen 
to reduce the total risk. That change necessitates 
developing a new unbiased estimator of risk. This 
was made possible, and easy, by a simple calculus 
pioneered by Stein [25, 27] and independently by 
Berger's integration by parts technique [2]. 

This truncated shrinkage estimator's improve- 
ment shows that the James-Stein estimator is 
inadmissible itself. The improved truncated esti- 
mator also is inadmissible, as it has a discontinu- 
ous derivative, while admissible estimators must 
have all their derivatives (as a function of the 
data). The search for admissible estimators began 
soon after the James-Stein estimator, for exam- 
ple. Stein [14] and Brown [4]. 

4. We already have noted that there is no agreed- 
upon way to estimate the component variances 
of the JS estimator. Correspondingly, there is no 
way to determine separate confidence intervals 
for each ^j. Confidence ellipsoids, for example. 
Stein [26] and Brown [5], can be and have been 
developed for the equal variance setting. How- 
ever, ellipsoids may be unattractive to a data an- 
alyst who has the alternative of estimating with yi 
and using V^'"^ as the standard error, with a cor- 
responding exact confidence interval for each com- 
ponent obtained via the Normal distribution. Un- 
fortunately, only aggregates (ellipsoidal sets in 



this context) can provide uniformly better cov- 
erage if coverage must hold conditionally on the 
underlying // for all //, that is, with Model-I eval- 
uations. There is no agreed upon component-wise 
procedure for standard errors and intervals for in- 
dividual components /ij simply because no such 
procedure is possible as a function of //. This 
problem (and others too) can be rectified only via 
acceptance of a two-level, random effects model 
referred to here as Model-II. 
5. The overriding difficulty for the JS estimator as 
a practical tool for data analysts is that, except 
for data produced by carefully designed experi- 
ments, real data rarely occur with equal variances 
Vi = V . Even the hospital data of Table 1 do not 
have exactly the same variances. The first au- 
thor has participated in developing and in using 
shrinkage techniques for hospital profiling and for 
other applications (e.g., [7, 17]) without ever see- 
ing hospital or medical data with equal variances, 
simply because hospital caseloads (numbers of 
patients) vary considerably. For this initial dis- 
cussion to illustrate the JS estimator and related 
shrinkage procedures in the equal variances set- 
ting, we have picked 10 of the 31 hospitals (the 31 
to be described later) that had similar variances. 
These 10 each have sample sizes within 15% of 
550 patients. 

2. THEORETICAL AND BAYESIAN 

DEVELOPMENTS FOR THE EQUAL 

VARIANCE CASE 

This section reviews expansion of the assumptions 
of "Model-I" to a two-level model, "Model-II," which 
at Level-II includes a random effects model on the /Xj , 
with the Level-II parameters unknown but estimable 
from the data. Model-II and Stein's harmonic prior 
(SHP), to be introduced in this section, will be espe- 
cially important as a basis for developing frequency 
procedures in the difficult unequal variances situa- 
tion of Section 3. After briefly introducing the un- 
equal variances case in this section, the equal vari- 
ances setting is studied because of its relatively easy 
calculations. This enables Bayesian analysis that uses 
formal priors on the Level-II parameters that pro- 
duce shrinkage estimators as posterior means. In 
the equal variance setting, many of these estima- 
tors have been proven to be minimax (some also are 
admissible) in the original Model-I sense of Stein, 
that is, for total square error loss and for every 
possible mean vector /i. The centerpiece of this sec- 
tion is Figure 2, which displays graphically certain 
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Formal Bayes 
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Fig. 2. Classification of the proper and formal priors of the form given in (2). The u-axis determines limits on mininiaxity, 
with smaller values providing less shrinkage. Larger ko indicates more prior information, and thus more shrinkage. 



famous distributions on the Level-II variance, "A" 
that are known to provide minimax shrinkage esti- 
mators. Of central importance is Stein's harmonic 
prior (SHP) on /z, which stems from imposing an 
improper flat prior on A and yields an admissible, 
minimax modification of the James-Stein estimator. 
This SHP shrinkage estimator leads to posterior in- 
terval estimates that meet confidence requirements 
for coverages in Model-II evaluations. 

A generalization of the James-Stein estimator al- 
most always is required in practice because the un- 
equal variance situation arises, and also because data 
analysts often must provide interval estimates. Uni- 
form risk dominance as a function of /i will be seen in 
Section 3 to require inappropriate (reversed) shrink- 
age patterns in practice. Of course, shrinkage meth- 
ods are used commonly in applications, almost al- 
ways being based on a two-level random effects mo- 



del, the fii being random effects with their own dis- 
tributions. Such models belong to frequentists and 
Bayesians alike, known as hierarchical models, mul- 
tilevel models, empirical Bayes models and by other 
terms. Table 2 shows one such model with Nor- 
mally distributed observations (Level-I), and Nor- 
mally distributed random effects (Level-II). The two 
columns, that is, the Descriptive and the Inferential 
versions of the model, are equivalent in that both 
sides give rise to the same joint distributions of the 
data and the random effects, (y,/u), given the hy- 
perparameter a that governs the joint distribution. 
These models allow "unequal variances" Vi , perhaps 
because Vi = a'^/rii with different sample sizes. That 
anticipates Section 3, but in this "equal variances" 
Section we always assume Vi = V. 

In what follows, Model-I will refer to the distribu- 
tion of y|/.i at Level-I of Table 2 which treats fi as 
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Table 2 
Multilevel model layout 



Level Descriptive version 



Inferential version 



I 

II 
III 



y^\^^^'^^J^{fl^,V^), i=l,...,k y,\a'~ J^{x',l3 ,V^ + A) 



a ~ iv{a) 



iJi^\y,a''^M{{l~B,)y,+Bix[p,Vi{l-Bi)) 



the unknown parameter, whereas Model-II will re- 
fer to the random effects model combining Model-I 
and the Level-II distribution of /i|a, which has un- 
known parameter a = (/5,^). Model-III will refer to 
the fully Bayesian model embracing all Levels I, II 
and III for a single prior T:{a) on a, and is used here 
primarily to construct Bayes rules to be evaluated 
via the assumptions of Model-I or Model-II, in the 
frequency sense for all y. 

If the hyperparameter a = (/3, A) were known in 
Model-II of Table 2, one would use the Level-II dis- 
tribution of /i|y,a in Table 2 to make inferences 
about each component value /Xj. For squared error 
loss the best estimator of ^Uj then would be the poste- 
rior mean, which estimates ^i by using the shrinkage 
factor 



Bi 



Vi 



Vi + A 

to compromise between the prior mean x'^fi and the 
sample mean yj. 

While shrinkages needn't arise for many distribu- 
tions that one could choose for Level-II, they do with 
the Normal distribution on /ij because the Normal 
distribution on // in Level-II is conjugate to the Nor- 
mal Level-I likelihood. Conjugate priors at Level-II 
lead to linear posterior means and shrinkage coef- 
ficients for the Normal and for other exponential 
family models too; see, for example, Diaconis and 
Ylvisaker [8] and Morris and Lock [21]. They also 
are the "G2 minimax" choice for Level-II [13, 18]. 

With k > r + 2 components, it is not required 
to assume a = (/3, A) is known because information 
builds up through the k observations y^, whose dis- 
tributions are governed by their shared dependence 
on (/3, A) via the likelihood function given by the 
right half of Level-I in Table 2. For the rest of this 
section we focus on the simplest case of the Ta- 
ble 2 model with /3 = (r = 0). Thus, a = ^ is 
the only unknown hyperparameter. With equal vari- 
ances, studying the case /3 = is much less restric- 
tive than it might seem because use of the orthog- 
onality trick described in Section 1 allows develop- 



ments for /3 = to be extended back to the case 
with (3 unknown. 

Early work on the equal variance case strongly 
emphasized Model-I squared error evaluations made 
conditionally on ji. Even so, it was realized, for 
example. Stein [24], that if one also assumes Mo- 
del-II, then it is easy to motivate shrinkage estima- 
tors and the JS estimator, since one can estimate A 
by considering the likelihood of A, or, equivalently, 
of Bi = B. The likelihood of B follows from the 
marginal distribution oi y\B in the inferential col- 
umn of Table 2 which has the form of a Gamma 
density, but conditioned on B < 1, 

L{B) = B^/'^ex.Y>{-BS/2). 

Because of the equal variance assumption, L{B) only 
depends on the 1-dimensional sufficient statistic 
for B in the model for y\A: 



s = EyI/v. 



i=l 



The maximum likelihood estimate oi B is B = k/S. 
However, B (not A) enters linearly in E[fj,i\y], and 
by noting that 



S\Br^B 



X(fc)) 



one sees that the James-Stein shrinkage estimate 
-Bjs = {k — 2)1 S is the best unbiased estimate of B. 
Both of these estimates lead to shrinkage or "empir- 
ical Bayes" estimators of in via substituting kjS or 
(k — 2)/ S for the shrinkage B, where B appears in 

E[ii,\y,B] = {l-B)y,. 

Minimaxity of these and of other shrinkage es- 
timators can be checked via Baranchik's minimax 
theorem, from his 1964 dissertation [1] under Stein. 
Assume the equal variance Normal setting of Ta- 
ble 2, r = 0, A: > 3, and Model-I only. Suppose an 
estimator shrinks its k components toward based 
on a shrinkage factor of the form 

B{S) = u{S)/S, 



C. N. MORRIS AND M. LYSY 



with u{S) nondecreasing and with < u{S) < 2{k — 
2). Then the estimator is minimax for total mean 
squared error risk under Model-I, that is, with risk 
at most kV for all ^. A similar but more general con- 
dition for minimaxity that lets u{S) be decreasing 
also exists [10]. These minimaxity conditions easily 
extend to include shrinkage toward a fitted r > 
dimensional subspace by making 5 be the residual 
sum of squares and by accounting for the loss of r 
degrees of freedom, so then < u{S) < 2{k — r — 2) 
is required. 

2.1 Bayes and Formal Bayes Rules 

The model of Table 2 can be expanded to Le- 
vel-Ill to allow Bayesian and formal Bayesian in- 
ferences by assuming that a in general (in our sim- 
plified context, the unknown variance parameter A) 
has a proper or improper prior distribution. Shrink- 
age factors are then determined as integrals over the 
posterior distribution of B, 



E[B\S] 



J^'BL{B)n{B)dB 
J^L{B)7T{B)dB 



for some prior density tt on B. 

Two obvious families of priors arise in this con- 
text, to be charted in Figure 2: 

1. Scale-invariant priors on A. Indexed by constants 
c > 0, these are improper (i.e., not finitely inte- 
grable) formal priors, with differential elements 

A^/^ dA/A, A>0. 

As a distribution on B, this corresponds to 

_B-^/2-i(i_5)=/2-idB, 0<B<1. 

These have the form of Beta densities, but they 
do not integrate finitely. Only propriety of the 
posterior distribution is required, that is, after 
multiplication by L{B), which imposes the addi- 
tional restriction < c < A;. 

2. Conjugate priors on B take the form of the like- 
lihood function L{B), but with different values 
of k, S. We index this conjugate family by /cq > 2 
and by ^o > 0, perhaps thinking of them as pre- 
vious values of k and S. Posterior propriety now 
requires that /cq satisfy ko + k> 0. The prior and 
posterior densities take the same form as L(B), 
having differential element 

^(fco-2)/2 exp(-S5o/2) dB/B, 0<B<1. 

If ^o > 0, these are "truncated" xfk -2) distribu- 
tions on B <1, scaled by 5o. This second family 



involves proper priors if A;o > 2, known as "Straw- 
derman's priors" [28] when Sq = 0. Strawderman 
showed (via Baranchik's theorem) that the pos- 
terior mean of B for these priors provides mini- 
max and admissible shrinkage estimators if k^ < 
k — 2 (so A; > 5 is required). These properties also 
hold if So > 0. When 5o = 5 has a Beta((fco - 
2)/2, 1) distribution and 

EB = (ko - 2)/ko 

a priori, again requiring /cq > 2 for propriety. EB < 
(k — 4)/(A; — 2) is the upper limit for minimaxity, 
requiring k > 5. The special choice /cq = 4 puts 
a Uniform(0, 1) prior distribution on B and min- 
imaxity then requires A; > 6. Derived from proper 
priors, the posterior mean of /i, given the data y 
for any of these Strawderman priors, automati- 
cally qualifies as an admissible, minimax estima- 
tor in the Model-I sense for quadratic loss. 

The densities of these two prior families can be 
combined by multiplication (and some reparametri- 
zation) to yield a 3-parameter family with densities 
on B of the form 

p{B\ko,c,So) 

(1) ocB(^-«-=)/2-1(^_^)c/2-i 

■exp{-BSo/2)dB, 0<B<1. 
If 5*0 = 0, this class of prior densities has the form 

(2) Beta{l{u-k),l{k + ko-u)) 

with u = k + ko — c. They are proper only if ko > 
c > 0, that is, if ko > u — k > 0. The posterior den- 
sity is proper if and only if ko + k > c> since the 
exponential term that also appears in the posterior 
density, exp(— i?(S' + S'o)/2), is bounded in B so that 
term cannot affect posterior propriety. 

Figure 2 shows the key regions for this formal Beta 
family (2) of prior densities (scaled for k = 10) in 
terms of the two parameters {u,ko), ignoring the 
nearly irrelevant 5*0 = 0. It emphasizes regions when 
minimaxity holds, < n < 2(A; — 2). Instead of c, 
the horizontal axis uses u = k + ko — c, because u 
determines minimaxity. It can be seen that as S* — )■ 
oo, S X £'[i?|S'] — > u for these priors, and Baranchik's 
theorem tells us that minimaxity for large S fails 
unless < ti < 2(A; — 2). This condition is necessary 
for minimaxity, but not sufficient. 

Some explanation is in order, as follows in (a) 
through (h): 

(a) Priors on B = V/{V + A) that lead to mini- 
max estimators are limited to < n < 2(A; — 2). 
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(b) The posterior distribution is proper only if 
/cq > u — k. The 45 degree hne ko = u — k in Figure 2 
marks the (unattainable) lower bound for these pri- 
ors. 

(c) Proper priors require /cq > c, so that u> k. 
Proper priors lie in the darkly shaded region to the 
right of the vertical line u = k. Improper priors are 
those with u<k. 

(d) The scale-invariant priors A'^''^ dA/A on ^ > 

are on the horizontal axis A;o = 0, so c = k — u 
in these priors. Posterior propriety for these priors, 
as seen in (b), requires < u < A; so that shrinkage 
cannot extend all the way to the Baranchik limit 
2{k — 2). Scale invariant priors cannot be proper. 

Viewed as distributions on fi, these scale-invariant 
priors have differential elements 

d/x/iiMir 

[by integrating the J\f{0,AIf:) density with respect 
to A'^/'^ dA/A, and using c = k — u]. Ifn = 0, that 
is, the prior located in Figure 2 at (n,/co) = (0,0), 
we have the fc-dimensional Lebesgue measure d/i, 
which leads to using y as the estimator of fj,, that is, 
no shrinkage. 

One never should use (u, fco) = (A;,0), although 
researchers sometimes make this mistake, thinking 
that the prior is vague because this is Jeffreys' form 
dA/A in other contexts. Actually, this prior forces 
B = 1 a posteriori, no matter what the magnitude 
of S might be. Obviously this full-shrinkage estima- 
tor cannot be minimax. 

(e) The conjugate priors ije^o-^)/^ dB onO< B < 

1 (setting Sq = 0) form the upsloping line kQ = u — 
(k — 2). These have proper posteriors because this 
line lies above (and is parallel to) the line kQ = u — k. 
They are proper if u > fc, that is, /cq > 2, being 
Strawderman's priors. 

All these conjugate priors produce an easily cal- 
culated shrinkage factor (u need not be an integer 
in the Chi-squares) in this equal variances setting: 

u Plxf ^o^ < S + So] 

In this expression, SB is monotone increasing in S 
because the ratio of xfu+2) ^^"^ ^fu) densities is mono- 
tone increasing. Therefore, Baranchik's theorem ap- 
plies and verifies minimaxity. 

(f ) The vertical line at u = k — 2 denotes priors 
that have the smallest Model-I risks as ||/.i|| — )■ oo. 
This holds because all priors in Figure 2 have shrink- 
ages E[B\S] near to B = u/S for large S, and this 
must occur when ||/_i|| is large. 



On the other hand, the mean-squared-error risk 
for shrinkage estimators of the form u/S = aBjs, 
with a = u/{k — 2) for any < a < 2, is 



E 



^{{l-u/S)yi-Hi 



M 



i=l 



= k-{k-2)a{2-a)E[Bjs]- 

This risk is minimized uniformly at a = 1 , showing 
that the James-Stein estimator is optimal among 
estimators of the form u/S. Combining these two 
facts shows that minimax priors with u = k — 2 lead 
to estimators with risk functions that, for large ||/i||, 
will be smaller than those in Figure 2 with u^ k — 2. 
(g) Admissibility of the resulting Bayes estima- 
tors of iJ, holds immediately for proper priors, so the 
priors in the rightmost wedge with k <u< 2{k — 2) 
provide admissible minimax estimators. 

Improper priors may or may not produce admis- 
sible estimators. Various estimators based on priors 
with k — 2 <u < k are admissible and minimax at 
least if ko isn't too small. The SHP prior, which cor- 
responds to (ujko) = {k — 2,0), dA is an improper 
prior that does yield an admissible estimator. More 
on this later. 

(h) Inadmissibility holds for many (perhaps all) 
of the priors with u <k — 2. That this holds is sug- 
gested by the fact that the risk of an estimator with 
u < k — 2 can be lowered for large \\fi\\ by using 
a prior with u = k — 2 [as argued in (f ) above] . Then 
it seems likely that such a prior can be found on 
the u = k — 2 vertical axis of Figure 2 that would 
increase shrinkage (shrinkage generally increases in 
the rightward direction on Figure 2) with lower risk 
everywhere as a function of ||/i||. 

Early after it was recognized that the estimator y 
could be uniformly improved upon, numerous au- 
thors proposed priors captured by Figure 2, mo- 
tivated by Bayesian and/or admissibility concerns. 
Many of these were scale-invariant priors with fco = 
0, especially with k — 2 <u <k — 1, for example, 
Stein, K. Alain, T. Leonard, I. J. Good and D. Wal- 
lace, D. Rubin, D. V. Lindley and A. F. M. Smith. 
Others were proposed on the conjugacy line /cq = 
u— {k — 2), including dB/B, that is, {u, k^) = (k — 
1,1), which has Jeffreys' form, and (being improper) 
falls at the edge of Strawderman's priors. Various 
authors since have repeated these and other sug- 
gestions, partly as "reference priors." Our hope is 
that these priors that decision theory has shown to 
lead to the best and most trustworthy estimators for 
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the equal variances setting of Figure 2 are "trans- 
portable" to the unequal variances setting. 

Charles Stein's choice is a prior on /i, not on A, 
"Stein's harmonic prior," SHP, corresponds to fj, ha- 
ving a measure that stems from c = 2, A with a flat 
density. It is 

p{fi)dficcdi^i/\\fif-^. 

By (d), this corresponds to {u,ko) = {k — 2,0) in 
Figure 2. The term "harmonic" refers to the fact 
that the Laplacian of the prior V^p(/i) is uniformly 
equal to 0, except at the origin where it fails to ex- 
ist. Technically, since V^p(O) = — oo, the prior is ac- 
tually superharmonic (Laplacian less than or equal 
to 0) , a term Stein himself employed when showing 
that the resulting Bayes rule was both admissible 
and minimax by Model-I standards [27]. However, 
the term "harmonic" is simpler, nearly correct, and 
used by most researchers. 

One motivation for the SHP prior stems from an 
easy calculation that shows the James-Stein shrink- 
age coefficient satisfies E[B\S] = {k — 2)/S = Bjs 
if one assumes the (absurd) prior that A ~ 
Uniform [—y,cx3) [19]. Of course, allowing ^ < is il- 
logical, and removing that part of the support for A 
gives A ~ Uniform[0, oo), which yields the SHP. 

A second motivation is that taking A uniform on 
(0,oo) lies uniquely in Figure 2 at the intersection 
of the scale-invariant priors (A;o = 0) and the sloped 
line of conjugate priors [k^ = u — {k — 2)]. That is 
Stein's SHP. Indeed, the SHP sits on the "admissi- 
ble boundary," being the scale-invariant admissible 
prior that shrinks least among the admissible ones. 
It is also optimal as ||//|| — )• oo (u = k — 2). Being 
formal Bayes but not proper Bayes, it provides lit- 
tle prior information about A. Its conjugacy makes 



its shrinkages easy to compute in the equal variance 
setting. 

A third motivation, as will be seen, is that the ag- 
gregate conditional posterior risk R* = R*(S) < kV 
for this prior, and, in turn, R* exceeds the unbiased 
estimate of the aggregate risk R{S), not shown, on 
the SHP estimator; see Morris [16, 19]. More on this 
momentarily. 

Using (3), the posterior mean of B resulting from 
the SHP prior is 



Bsnp = E[B\y] = ^ X 



PK-2) < s] ' 



The posterior variance of B [16, 19] is, for A; > 3 
V = var(i?|y) 

r-^SPH ~ (-Sjs - -BsHp) 1 



k-2 






k-2 



Bswp ■ 



For the A; = 10 hospitals we have -Bshp = 0.668 x 
0.829 = 0.571 and v = (0.218)^. 

From the SHP posterior mean -BsHP we obtain the 
formal Bayes rule of /ij , 

/isHPi = E[iii\y\ = (1 - Bsm')yi- 

But what of interval estimates for /ij? Our Mo- 
del-Ill construction via SHP suggests use of pos- 
terior probability intervals. For the SHP these can 
easily be approximated after computing the poste- 
rior variance of /i j , which for r = is 

s1 = Yai{ni\y) = V{1 - BsHp) + vyf. 

Figure 3 from Morris and Tang [22] shows cover- 
age rates of /ij for 2-sided intervals with nominal 
coverage 95%. Each interval is centered at its SHP 
shrinkage estimate and approximates each of the k 



k = 4 



k=10 



k = 20 
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Fig. 3. Exact coverage probabilities against true shrinkage factor B — V/{V + A) for two equal variances rules, SHP (dark 
curve) and the ADM approximation to SHP (dotted curve), with nominal 95% coverages, for r = Q and fc = 4, 10, 20. 
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posterior distributions as Normally distributed with 
interval widths determined by adding and subtract- 
ing 1.96sj. 

The true coverages in Figure 3 for SHP are never 
less than 94.5% for any value of A for any of the 
three values of A; = 4, 10, 20 shown. The coverage 
probabilities do not depend on i or on V, so this 
is tantamount to a proof that this procedure comes 
close to providing or exceeding the nominal cover- 
age. Over-coverages rise noticeably above 95% as the 
between-groups variance A approaches 0, that is, as 
the shrinkage B approaches 1. One must keep in 
mind, however, that while these intervals are based 
on the posterior mean and variance, they are not the 
posterior intervals because such are not symmetric. 
For example, the under-coverage by 0.5% for SHP 
when k = 20 does not account for the posterior skew- 
ness of the distributions of /ij , which is considerable 
when Hi is one of the extreme observations. 

Intervals based on a different estimator, deter- 
mined by an approximation technique called adjust- 
ment for density maximization (ADM) [20, 22], also 
are shown in Figure 3, having slightly better min- 
imum coverage. These estimators are described in 
the next section. 

Componentwise intervals better than those cen- 
tered at i/i, that is, intervals that average being 
shorter than 2 x 1.96sdj, do not exist for all /i, by 
Model-I standards. Such may exist for all A, when 
averaging over i^\A in Model-II. Indeed, note that sf 
also can be interpreted as the Bayes risk of the SHP 
rule /isHP,j, 

Ri = £^[(AsHP,i - fJ'ifly] = sj. 

Let us contrast this with 

R, = V{l-2B) + y^i{B^ + 2v), 

the unique unbiased estimate of the component risk 
of AsHP,i- That is, 

ERi = £'[(/zsHP,j - p.if\lA 
is the Model-I component risk for any value of /i. 
Letting R* = '}2i=is'i and R = X^j=i-Ri, by rear- 
ranging terms [16, 17] one sees that 

R<R* < kV. 

That R* < kV shows Model-I minimaxity of /isHP) 
since its risk is less than that of the minimax y. 
That R < R* shows that the SHP prior is so vague 
that its Bayes risk is more conservative than its 
frequency-based unbiased estimate of risk. Averag- 



Table 3 

SHP estimates and posterior standard deviations of indices 

of success rates in the 10 NY hospitals, and two estimates of 

the associated risk 



Vi 


A*SHP,i 


Si 


Ri=sl 


Ri 


-2.15 


-0.92 


0.81 


0.649 


1.803 


-0.34 


-0.15 


0.66 


0.435 


-0.092 


-0.08 


-0.03 


0.66 


0.430 


-0.138 


0.01 


0.00 


0.66 


0.429 


-0.141 


0.08 


0.03 


0.66 


0.430 


-0.138 


0.57 


0.24 


0.67 


0.445 


-0.004 


0.61 


0.26 


0.67 


0.447 


0.015 


0.86 


0.37 


0.68 


0.465 


0.170 


1.11 


0.48 


0.70 


0.488 


0.377 


2.05 


0.88 


0.79 


0.629 


1.627 



ing over both fi and y, the k componentwise risks 

E[R*\A]=E[{ft,-f,i)^\A] 

are all the same. Thus, each is less than V for all A > 
0. This establishes Model-II componentwise mini- 
maxity, that is, improvement on y, for all ^ > and 
for every i = l, . . . ,k. 

Not only is the SHP rule componentwise minimax 
under Model-II evaluations, but its (approximate) 
coverage intervals are shorter on average than those 
accompanying the unbiased estimate y (since Esi < 
VV by Jensen's inequality). However, values of y 
exist for which some sf >V, although this happens 
with small probability. 

For the 10 hospitals we obtain R* = 4.85 and R = 
3.48. Componentwise risks and other calculations 
are displayed in Table 3. Notice that some com- 
ponents have negative unbiased estimates of their 
mean-square-error, a not uncommon occurrence, and 
an undesirable feature of using this unbiased estima- 
tion approach for assessing component risks. 

Unfortunately, real data rarely come with equal 
variances, designed experiments being the exception. 
Decision theorists have focused on this symmetric 
case because it is simple enough to enable exact 
(small sample) calculations. Decision theory has iden- 
tified the SHP and other priors close to it that lead 
to shrinkage estimators with good frequency prop- 
erties. Now the hope is that such priors are "trans- 
portable" to the unequal variances situation. 

It should be clear that Model-I verifications are 
rarely appropriate for scientific applications, even 
when equal variances obtain. Acceptance of Model-II, 
and thus of evaluations that average over Level-II 
distributions (given the hyperparameters, e.g.. A), 
has many advantages for applications. It makes as- 
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sessing weights for the loss function become unim- 
portant. Model-II ahows estimators to exist that are 
minimax for every component for all A, not just 
when summed over all components. Confidence in- 
tervals exist that are on average shorter than stan- 
dard intervals centered at yi, and these also can have 
(nearly) uniformly higher coverages. The unequal 
variance setting gives further impetus to Model-II 
as a basis for evaluating the operating characteris- 
tics of shrinkage procedures, and also for construct- 
ing them from proper or improper priors that lead 
to good repeated sampling properties. 

3. APPROACHES TO UNEQUAL VARIANCE 
DATA 

In practice, equal variances are the exception ra- 
ther than the rule. The variances for all 31 NY hospi- 
tals, not just the middle 10, differ by a factor of mo- 
re than 20. Table 4 lists the data for these A; = 31 



NY hospitals and several shrinkage-related estima- 
tors, to be discussed further. The raw data contain 
the number of deaths di within a month of CABG 
surgeries for each hospital i, sorted by increasing 
caseload n^. The indices for success rates are calcu- 
lated as 

yi = C X (arcsin(l — 2di/ni) — arcsin(l — 2d/n)), 

a variance stabilizing transformation of the unbiased 
success rate estimates pi = di/rii, assuming Bino- 
mial data, in which case the variance of the yi is 
approximately Vi = n/rii (with n = -j^ X^j=i "■«)■ The 
factor C is chosen so that the harmonic mean of 
the Vi, that is. 



is equal to 1. Larger values of yi correspond to higher 
success rates. The 10 hospitals used in the previous 



Table 4 
NY hospital profiling data and shrinkages 



y 



sd 



-v/v 



MsHP 



SSHP 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 



-2.07 

-0.22 

0.58 

-1.87 

-0.74 

-1.97 

-1.90 

2.31 

-0.14 

-1.21 

-1.43 

1.56 

-0.00 

0.41 

0.08 

-2.15 

-0.34 

0.86 

0.01 

1.11 

-0.08 

0.61 

2.05 

0.57 

1.10 

-2.42 

-0.38 

0.07 

0.96 

-0.21 

1.14 



2.78 
2.76 
1.57 
1.42 
1.39 
1.37 
1.36 
1.32 
1.22 
1.22 
1.20 
1.14 
1.10 
1.08 
1.04 
1.03 
1.02 
1.02 
1.01 
0.98 
0.96 
0.93 
0.93 
0.91 
0.90 
0.84 
0.78 
0.75 
0.74 
0.66 
0.62 



0.079 
0.081 
0.249 
0.305 
0.318 
0.327 
0.332 
0.352 
0.413 
0.413 
0.427 
0.473 
0.508 
0.527 
0.568 
0.579 
0.590 
0.590 
0.602 
0.639 
0.666 
0.710 
0.710 
0.742 
0.758 
0.870 
1.000 
1.000 
1.000 
1.000 
1.000 



0.947 
0.946 
0.850 
0.823 
0.817 
0.812 
0.810 
0.801 
0.774 
0.774 
0.769 
0.750 
0.736 
0.729 
0.714 
0.710 
0.706 
0.706 
0.702 
0.689 
0.680 
0.666 
0.666 
0.656 
0.651 
0.619 
0.584 
0.565 
0.558 
0.501 
0.470 



0.952 
0.952 
0.864 
0.839 
0.833 
0.829 
0.827 
0.818 
0.794 
0.794 
0.788 
0.770 
0.758 
0.751 
0.736 
0.733 
0.729 
0.729 
0.725 
0.713 
0.704 
0.691 
0.691 
0.681 
0.677 
0.646 
0.611 
0.592 
0.586 
0.529 
0.498 



0.922 
0.921 
0.790 
0.754 
0.746 
0.741 
0.738 
0.726 
0.694 
0.694 
0.687 
0.664 
0.648 
0.640 
0.622 
0.618 
0.613 
0.613 
0.608 
0.594 
0.584 
0.568 
0.568 
0.558 
0.552 
0.518 
0.481 
0.461 
0.455 
0.399 
0.369 



0.926 
0.925 
0.808 
0.777 
0.770 
0.766 
0.763 
0.753 
0.725 
0.725 
0.719 
0.700 
0.686 
0.679 
0.664 
0.660 
0.656 
0.656 
0.652 
0.640 
0.631 
0.618 
0.618 
0.609 
0.604 
0.575 
0.542 
0.525 
0.519 
0.469 
0.442 



0.047 
0.047 
0.103 
0.115 
0.118 
0.119 
0.120 
0.124 
0.133 
0.133 
0.134 
0.140 
0.144 
0.146 
0.149 
0.150 
0.151 
0.151 
0.152 
0.155 
0.157 
0.160 
0.160 
0.161 
0.162 
0.167 
0.171 
0.173 
0.174 
0.177 
0.178 



-0.15 

-0.02 

0.11 

-0.42 

-0.17 

-0.46 

-0.45 

0.57 

-0.04 

-0.33 

-0.40 

0.47 

-0.00 

0.13 

0.03 

-0.73 

-0.12 

0.30 

0.00 

0.40 

-0.03 

0.23 

0.78 

0.22 

0.44 

-1.03 

-0.17 

0.03 

0.46 

-0.11 

0.64 



0.76 
0.76 
0.69 
0.70 
0.67 
0.70 
0.70 
0.72 
0.64 
0.66 
0.66 
0.66 
0.62 
0.61 
0.60 
0.68 
0.60 
0.61 
0.60 
0.61 
0.58 
0.58 
0.66 
0.58 
0.59 
0.68 
0.53 
0.52 
0.54 
0.48 
0.51 



3 


67 


2 


68 


5 


210 


11 


256 


9 


269 


12 


274 


12 


278 


4 


295 


10 


347 


13 


349 


14 


358 


7 


396 


12 


431 


11 


441 


13 


477 


22 


484 


15 


494 


11 


501 


14 


505 


11 


540 


16 


563 


14 


593 


9 


602 


15 


629 


13 


636 


35 


729 


26 


849 


25 


914 


20 


940 


35 


1193 


27 


1340 
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sections appear here as Hospitals 15-24, but in a dif- 
ferent order. 

The yi cannot be nearly Normally distributed 
when rii is small, for example, Hospitals 1 and 2, 
but we act here as if the yi are Normal because that 
distribution is required for the estimators being con- 
sidered. A more accurate model might approximate 
the data di as Poisson, as Christiansen and Mor- 
ris [7] do for medical profiling. For the remainder of 
this section we also focus on shrinkage to (r = 0), 
the approximate average of the y^. 

3.1 Minimaxity in Model-I 

It may seem for unequal variances that the James- 
Stein estimator, which requires equal variances, can 
still be used. To do this, one would divide the val- 
ues yi by their standard errors sdj = ^/Vi to create 
equal variances and apply James-Stein to yj/sdj. 
Then the shrinkage Bjs = {k — 2)/S = 0.697, where 
S = Y^ yf/Vi = 41.59, emerges for estimating /ij/sdj. 
Transforming back to estimate /Xj yields a constant- 
shrinkage estimator 

Ajs,j = (l-^js)?/i- 

This procedure is Model-I minimax if the loss func- 
tion, 

k 



L{fi,f^) = ^W,{fii 



fJ-i 



i=l 



has weights Wi = 1/Vi = rii/n. However, if the loss 
function has equal weights Wi = 1, then this esti- 
mator won't be minimax when the variances, equiv- 
alently the patient case- loads n^, are substantially 
unequal, that is, it won't have uniformly lower mean 
squared error than y for all /i. Does any health 
leader exist with the insight to identify the proper 
weights Wi and the authority to enforce their use? 

For unequal variances, component shrinkages 
would be expected to depend on i. How should these 
shrinkages be estimated, and by what criteria should 
the estimates be guided? Data analysts desire more 
shrinkage for larger Vi and less for smaller V^, a pat- 
tern consistent with the law of large numbers, and 
with anticipated regression toward the mean, both 
of which suggest placing greater reliance on esti- 
mates yi that are based on more data and that have 
smaller variances. Paradoxically, Model-I minimax- 
ity in the unequal variance setting requires reversed 
shrinkages (more shrinkage for smaller V^), as shown 
next. 

Using an integration by parts technique pioneered 
by Stein [25] and Berger [2], Hudson [12] and Ber- 



ger [2] independently developed a simple Model-I 
minimax shrinkage estimator for the sum of (un- 
weighted) squared errors, that is, having risk less 
than X^l^, the risk of the unbiased estimate y, for 
all /i. Their estimator directly extends the James- 
Stein estimator to unequal variances by shrinking 
each yi toward using the shrinkage factor 

^ {k-2)/Vi 

Ei=i(yj/^j)^ 

More generally, this estimator can be adapted eas- 
ily to provide a minimax estimator for any set of 
weights Wi in the loss function (by rescaling the yi to 

1/2 

W^ yi, obtaining the shrinkage factors above, and 
then transforming back to the original scale) . In the 
special case Wi = 1/Vi, this rescaling will produce 
the James-Stein estimator with its equal shrinkages 

With equal weights Wi = 1, the risk of this mini- 
max estimator has a simple unbiased estimate: 

k 

Rmi = Y,Vi{l-{k-2)BnB,i). 

i=l 

This is less than ^ Vi for all values of y, because 
BuB.i > 0. It follows that the expectation of Rub 
given fi is less than YI ^i i thereby proving the Hud- 
son-Berger estimator uniformly dominates y and is 
minimax for an equally weighted loss function. 

For the /c = 31 hospitals the risk estimate of the 
Hudson-Berger rule is i?HB = 31.25. This is 36.3% 
smaller than the risk of the unbiased estimate's 
^ Vi = 49.06. Slightly more improvement stems from 
using shrinkages min(l,i?HB,i)- Five hospitals. Hos- 
pitals 27-31, have such -Bhb,j > 1, as shown in Ta- 
ble 4, and these shrinkages should be truncated at 1. 
However, these Model-I minimax shrinkage factors 
are smallest for the hospitals with the largest vari- 
ances, even though the purpose of combining data in 
these applications is to borrow strength and thereby 
improve estimates for hospitals with less data. 

Unfortunately, none of the 15 hospitals with the 
largest variances shrinks even as much as 2/3 of its 
standard error. By contrast, two of the six hospitals 
already with the most data and with the smallest 
variances (Hospitals 26-31) shrunk by about two of 
their own (small) standard errors, a dramatic ad- 
justment for them. This minimax estimator would 
thrill the management of Hospital 26, whose nega- 
tive performance estimate y26 (2.8 standard devia- 
tions below the mean) is shrunken upward by 2.5 
standard deviations to make it nearly average. On 
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Fig. 4. Shrinkage factors against Vi for various rules. The dots represent Vi for the 31 NY hospitals. 



the other hand, this minimax estimator shrinks Hos- 
pitals 27-31 ah the way to (the statewide average), 
so that Hospital 31 has its strong positive perfor- 
mance 2/31 = 1.14 (1.84 standard deviations above 
the mean) reduced by those 1.84 standard devia- 
tions so it also is estimated as average. 

3.2 Exchangeability in Model-ll 

The culprit here is the Model-I minimax crite- 
rion, and not the mathematically elegant procedure 
derived to achieve Model-I minimaxity. With sub- 
stantially unequal variances and summed equally- 
weighted squared error losses, achieving Model-I mi- 
nimaxity (nearly) requires reversed shrinkages, that 
is, smaller shrinkages for those components with lar- 
ger Vi . ( "Nearly" acknowledges that one could dras- 
tically diminish all the larger shrinkages to elimi- 
nate the reversal, but then with minuscule resulting 
shrinkages and no practical benefit.) Meanwhile, pro- 
cedures that do not suffer from reversed shrinkages 
abound in practice, by relying instead on exchange- 
ability assumptions in multilevel models and on 
Bayesian and empirical Bayesian considerations. 

Figure 4 shows the Hudson-Berger Model-I min- 
imax shrinkage factors, labeled as "HB," plotted 
against the variances Vi. Note their reversed shrink- 
ages that decrease as variances increase. The James- 
Stein shrinkage factors are constant at i?js = 0.697, 
as shown by the horizontal line labeled "JS." Four 
other shrinkage rules will be introduced next, all 
motivated by Model-II considerations, so all with 
shrinkages that increase as variances increase. 

Componentwise risks and interval coverages beco- 
me more valuable when based on averages over both 
levels of Model-II. This requires accepting Level-II 



exchangeability for the random effects /ij (or when 
r > 0, accepting exchangeability of the residuals fii — 
x^f3), given A. Shrinkages now may increase as the 
variances Vi increase. Exchangeability of // (or of its 
residuals) replaces assessing weights for component 
losses in applications. As in the equal variances case, 
procedures that dominate on all k components be- 
come possible, as well as confidence intervals. With 
decision theoretic Model-II evaluations, componen- 
twise dominance becomes the goal. 

Most data analysts and modelers of real data are 
familiar with recognizing problems for which exchan- 
geability assumptions are reasonable, for example, 
they make such judgements routinely for error terms 
when fitting regressions. Exchangeability considera- 
tions would stop anyone from combining estimates 
of butterfly populations and percentages of sports 
car sales to augment the estimation of the 31 NY 
hospital success rates. Model-I standards provide no 
guidance on this, in favor of requiring assessment of 
relative weights Wi for butterfly vs. hospital data. 

With sufficiently disparate Vi, the minimax esti- 
mator of Hudson and Berger is not necessarily min- 
imax for every component by Model-II evaluations. 
However, Model-II minimax shrinkage estimators do 
exist for any set of 1^. A recent such procedure by 
Brown, Nie and Xie [6] produces shrinkages that in- 
crease with Vi and with componentwise squared er- 
rors smaller than Vi for every i, for all ^ > 0, and 
for any variance pattern Vi, . . . ,Vk for k>3. 

A popular Model-II shrinkage technique is based 
on the MLE of ^. It provides relatively simple MLE 
estimates of the shrinkages i?MLE,i = Vi/{Vi + AMLE) 
and of the unknown means fiMLE,i = (1 — BMLE,i)yi- 
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It is often used to construct confidence intervals for 
the ^i by estimating the conditional variance 

validly, A) = {l-Bi)Vi. 
For r = 0, Amle maximizes 

k 

L{A) = ^{-S,B, + log{B,))/2, 

where S^ = yf/Vi and B, = Vi/{Vi + A). If r > and 
Level-II in Table 2 specifies an unknown mean 

E[fii\a] =x-/3, 

then restricted maximum likelihood (REML) should 
be used. This can be accomplished by analytically 
integrating out (not maximizing out) the r-dimensio- 
nal /3, assuming its prior density is flat in r dimen- 
sions, as in [22]. In this case the likelihood L{A) 
above would be replaced by the resulting integral 
over (3, and then maximization would lead to A^eml- 

When r > 0, a larger value of k is required for any 
possibility of minimaxity, at least k>3 + r, with k > 
5 + r needed for minimaxity of the MLE in the equal 
variance case. The MLE shrinkages are graphed in 
Figure 4 for the 31 hospitals on the curve labeled 
"MLE." 

A flaw of the MLE is that >1mle = occurs com- 
monly. This not only dictates full shrinkage, but 
also when r = the conditional variance estimates 
(1 — -Bmle.O^ ^^^ ^11 equal to zero. In such cases us- 
ing these for confidence intervals asserts that ^j = 
with 100% confidence, a gross overstatement [22]. 

3.3 Construction at Level-Ill 

Bayesian modeling extends Model-II to Model-Ill 
by constructing procedures from a single prior on 
the hyperparameters at Level-Ill. Bayes and formal 
Bayes procedures provide posterior means, variances 
and posterior distributions for the random effects /ij, 
given the data. As such Model-Ill Bayesian proce- 
dures are widely used in applications, the question 
is: what are their frequency properties? The pos- 
terior moments and distributions may not be com- 
putable exactly, but they are estimable for any par- 
ticular data set and prior via MCMC and other sim- 
ulation techniques. Moreover, the fundamental the- 
orem of decision theory tells us that Model-Ill con- 
structions (Bayes and formal Bayes) are required for 
Model-II admissibility. 

From the decision-theoretic perspective much mo- 
re is yet to be learned, even for models as simple as 
the Normal distributions of Table 2 in Levels I-II. It 
still isn't known, even with r = 0, whether (formal) 



priors exist that provide Model-II minimax estima- 
tors of /i no matter how varied the Vi. Beyond that, 
only a little has been done in the unequal variance 
case to determine if posterior probability intervals 
for formal priors, perhaps computed to offer poste- 
rior coverages of 95%, actually cover /ij for every i, 
^ > at that nominal 95% level. 

3.3.1 Stein's prior: Transported from the equal va- 
riance case For the family of priors discussed in the 
equal variances case in Section 2, Stein's SHP stands 
out as the prime candidate for minimaxity and for 
confidence intervals in the unequal variances setting, 
assuming Model-II evaluations. Unfortunately, no 
general theorems about these properties have been 
proved for the SHP, formal mathematical proofs be- 
ing hindered by the complexity of the posterior mo- 
ments and intervals. However, particular investiga- 
tions with the SHP have been encouraging. 

Indeed, for any shrinkage estimator /ij = (1 — Bi)yi 
with < -Bj < 1, the difference between the compo- 
nent risks of yi and fii conditioned on A and y, 

n = E[{yi - iiif\A, y\ - E[{fii - Hif\A,y] 



?2„,2 



B^fuf 



Btvt - {B, 
= {2Bi-Bi)Biyf, 

is positive for any value of ^ < V^, which, when in- 
tegrating over y, shows that the Model-II risk of fii 
is less than Vi for any A<Vi. Also, SHP will domi- 
nate the unbiased estimate yi when A becomes large 
enough, since the componentwise Model-II risk con- 
verges to that of equal variances as A tends to in- 
finity. 

The estimator i?[/.ii|y] for any prior on A, for each i 
and set of variances Vi, involves computing E[Bi\y]. 
For the SHP, with L{A) being the Model-II likeli- 
hood of A, this is 



E[Bi\y] = Bsnp,: 



J^V/{V + A)L{A)dA 
CL{A)dA 



and the resulting estimate of /ij is 

/fSHP.i = (1 — -BsHP,i)yi- 
As with the equal variance case, the posterior vari- 
ances s? =var(/Xj|y) for any prior are given by 

sj = Vi{l-E[Bi\y])+Viyl 

where Vi is the posterior variance of Bi. For SHP 
this is 

Vi = var{Bi\y) 

_ JlVll{Vi + AfLiA)dA 



Io°^L{A)dA 



r2 
-°SHP,r 
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Fig. 5. Stochastic estimate of SHP's Model-II componentwise relative risk improvement for 31 variances as in the hospitals, 
as a function of A. The dots represent the values of A at which the simulations were performed (20,000 replicates of y for 
each A). 



The SHP shrinkage estimates -BsHP.i for the hos- 
pital data are plotted in Figure 4 on the curve la- 
beled "SHP." The associated posterior standard de- 
viations ^Jvi are given by the dotted curve labeled 
"-y/u." Figure 5 displays a stochastic estimate of the 
relative Model-II risk improvement of SHP over the 
unbiased estimate j/j, 

Vi-E[{fLsnY>,i-fii)M] E[n\A] 



Vi Vi 

for /c = 31 and for the variance pattern of the 31 
hospitals Vi, . . . , V31 as a function of A. 

This was done by simulating 20,000 replicates of y 
at 15 different values of A, and averaging the 20,000 
values of ri/Vi at each A. Different curves plot the 
risk improvement for different components i. All the 
curves are positive and strictly decreasing. The cur- 
ves are ordered according to their Vi values, the 
largest [Vi) providing the top curve. Thus, for this 
variance pattern, at least and seemingly generally, 
the greatest shrinkage benefit accrues to the com- 
ponents with the greatest uncertainty. 

The graph's monotonicity suggests that the min- 
imum Model-II risk improvement for each compo- 
nent occurs as A approaches infinity. That corre- 
sponds to the limiting equal variance case. Interest- 
ingly, despite their stochastic nature, the curves do 
not cross each other. These results, although only 
for one data set, give hope for establishing compo- 
nentwise Model-II risk dominance for all A of the 
SHP shrinkage procedure over the unbiased esti- 
mate y. 



For equal variances. Figure 3 showed that /isHP,i=t 
1.96ssHP,i produces minimum coverage of ^i very 
close to 95%. Figure 6 investigates the corresponding 
coverage properties for the unequal variances in the 
pattern of the 31 NY hospitals. For each y and A of 
the previous simulation, the coverage probability 

P{lii G {AsHP.i ± 1.96ssHP,i}|?/,^) 

of /ij by the "SHP Normal" interval given y and A 
is analytically computed from 



/ii|y,^'5.V[(l 



Bi)yi,V{l-B, 



then averaged over the 20,000 values of y for each A. 
Thus, Figure 6 displays the coverage probabilities 
for each Hospital i using Model-II of Table 2 as 
a function of the harmonic mean Bh of the shrink- 
age factors, 

k Vh 1 



Bh 



y^ ,Br^ Vh + A 1 + A' 



a monotone decreasing function of A (recall that the 
31 CABG indices have been scaled to have Vh = 1)- 
All but two of the 31 curves exhibit a pattern 
similar to that of equal variances in Figure 3 when 
k = 20: exactly 95% coverage for Bh close to 0, 
a minimum with 0.5% under-coverage near Bh = 
0.6, and over-coverage for Bh close to 1. The curves 
are nonintersecting and increasing with Vi for the 4 
highest values of Bh, but cross each other repeat- 
edly for Bh < 0.5, presumably because of simula- 
tion inaccuracy. The two nearly superimposed high- 
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Fig. 6. Simulation of SHP's 95% Normal interval Model-II componentwise coverage probabilities for each of the 31 hospital 
variances as a function of Bh ■ The dots represent the values of Bh at which the simulations were performed. 



est curves which never (or barely) overcover /ij cor- 
respond to the two hospitals with the highest vari- 
ances, Hospitals 1-2, these variances being nearly 8 
times the size of the 31 variances' harmonic mean. In 
all cases the coverage probabilities are never below 
94.5%. 

Figure 7 compares SHP and unshrunken estimates, 
and their standard deviations for the data with the 
31 NY hospitals. The absolute value of the rules, 
lAsHP,i| (circle) and \yi\ (+/—), are plotted above 
the X-axis, and the negative standard deviations, 
— ssHP.i and — sdj, are plotted below. "Plus" signs 
indicate that the estimates were positive, for exam- 
ple. Hospitals 3 and 8, whereas "minus" signs indi- 
cate that the estimates were negative, for example. 



Hospitals 1-2. It appears for these data that all the 
SHP coverage intervals will be shorter than those 
of the unbiased estimate, although this need not al- 
ways hold for all data sets y, as discussed earlier for 
the equal variances in Section 2. 

3.3.2 Posterior mean versus posterior mode: The 
ADM technique Deriving the SHP rule for unequal 
variances requires numerical computation ol k + 1 
integrals (including the common denominator 
in i?sHP,j)- ADM (adjustment for density maximiza- 
tion, Morris [20]) is used here for shrinkage estima- 
tion to provide a relatively simple approximation to 
the SHP, as in Morris and Tang [22]. To explain 
the ADM, the MLE provides a simple shrinkage for- 
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Fig. 7. SHP (black circles). Absolute values of unshrunken and SHP estimates with signs indicated by (+/—) top half. 
Standard deviations (bottom half) for SHP are always closer to than Vi. "Plus" signs indicate positive estimates yt > 0. 
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mula from the mode of the hkehhood L{A) that is 
equivalent to the posterior mode of A for the SHP. 
However, the mode of a right-skewed distribution 
hke that of A underestimates the mean. Further- 
more, the mean i?[i?j|y] is needed, not the mode. 
The ADM provides a better approximation than the 
MLE for shrinkage factors while still requiring only 
two derivatives to approximate the posterior distri- 
butions oi Bi\y. The ADM can be used with various 
priors in Figure 2, but here we apply it to approx- 
imate the posterior distribution of each shrinkage 
Bi\y when the SHP is the chosen prior distribution 
for A. 

For shrinkage estimation, ADM approximates the 
distribution of each Bi = Vi/{Vi + A) hy a Beta dis- 
tribution. Because shrinkage coefficients lie in [0, 1], 
and these coefficients linearly determine the Level-H 
distributions (Table 2), two-parameter Beta distri- 
butions are the natural choice for shrinkage approx- 
imations, and not the Normal distribution (the dis- 
tribution for which MLE and the posterior mean 
would coincide) . When the prior on A is taken to be 
the SHP, and with Beta distribution approximations 
to Bi = Vi/{Vi + A), the ADM "adjustment" simply 
amounts to maximizing A ■ L{A), rather than L{A), 
for each i and Vi. Note that the maximum always 
occurs with A > 0. Calling the maximizing value 
^ADM; then i?[i?j|y] is approximated by -Badm.i = 
Vi/iVi + Aadm)- This ADM approach has been used 
before for shrinkage estimation, for example, by 
Christiansen and Morris [7], Li and Lahiri [15] and 
Morris and Tang [22]. 

For the 31 hospitals, Aabm = 0.657, so E[Bi\y] 
is approximated by Bi = Vi/iVi + 0.657). The vari- 
ances of Bi could be obtained from the second deriva- 
tive of log(yl • L{A)) at the adjusted mode, j4adm = 
0.657. The ADM shrinkages are graphed in Figure 4 
on the curve labeled "ADM." They are more con- 
servative than those of the MLE, and indeed follow 
the SHP curve closely for all but the smallest vari- 
ances Vi. 

As was seen before in Figure 3, standard errors 
and interval estimates with the SHP coverages as 
approximated by ADM are never perceptibly below 
95%, for equal variances and fc = 4, 10, 20. The ADM 
is readily applicable to approximate posterior point 
and interval estimates for other priors on A in the 
unequal variance case. Further, Model-II evaluations 
of ADM include investigations by Morris-Tang [22] 
for Normal distributions, Everson-Morris [11] for 
multivariate Normal data, and Christiansen-Mor- 



ris [7] for Poisson data. Evidence therein with special 
cases and/or with special data sets has been quite 
encouraging, with no negative experiences thus far. 

3.4 Potential of the Multilevel Model: 
A Useful Rule of Thumb 

For equal variances, good shrinkage rules such as 
James-Stein or SHP are simple enough to calculate 
that they can be implemented immediately in prac- 
tice. For unequal variances the calculations are much 
more involved and easily accessed software may be 
unavailable or need to be mastered. Researchers jus- 
tifiably may ask how much they stand to gain by fit- 
ting a hierarchical model before actually fitting it, 
their alternatives being to use unbiased estimates 
Ai = Vi o^ the fully shrunken estimates, here ili = 
(for r = 0), or when r > to shrink all the way to 
a grand mean or to a linear regression estimate. 

A helpful feature of using MLE or ADM methods 
to fit shrinkages, perhaps with a model like that of 
Table 2, is that a simple point estimate A oi A suf- 
fices to estimate all shrinkage factors Bi , and conse- 
quently also all means /ij. Moreover, an estimate A 
of A leads to a simple estimate Bh of the harmonic 
mean of the shrinkage factors Bh through the iden- 
tity 



(4) 



Bh = Vh/{Vh + A). 



Analogously to its equal variance counterpart B, 
the harmonic mean shrinkage < Bh < 1 provides 
a useful summary for gauging the benefits of fitting 
a shrinkage model. Values of Bh close to suggest 
that there will be relatively little shrinkage overall, 
in which case a researcher might be justified to use 
the unbiased estimates yi. Or, values of Bh close 
to 1 might justify using the fully shrunken regres- 
sion estimates x[b, 

b = {x'v~^xy^x'v~^y, 

where X' = [xi, . . . ,Xk] and F = diag(Vi, . . . , V^). 
Values of Bh near 1/2 give the strongest case for 
estimating shrinkages. 

Letting S = Yli=i{yi - ^ibf/^ij when r > and 
the variances are equal, Bh = B and we have 

E[S\A] = {k-r)/B and E[{k - r -2)/S\A] = B, 

which leads to the James-Stein estimator. When the 
variances are unequal, it is easily seen for r = that 



E[S\A] 



k 

E 



{V, + A)/V, = k/BH. 
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Taken together these facts suggest a simple point 
estimate for Bh, 

k-r-2 k-r-2 1 
Bh = 7^ = —. X 



where 



a 



1 '' 



k — r a^ 



{y^ - <bf 



k-r ^ Vi 



is the mean square error from a (weighted hnear) 
regression output. Note that one can easily rear- 
range (4) to solve for 

A = Vh{1-Bh)/Bh. 

This estimate, in turn, can be used to provide sim- 
ple estimates of each individual shrinkage factors Bi 
by Vi/{Vi + A). Even if Bh is small, having this 
rough estimate of every Bi is useful in case there 
are a few Bi that are appreciably bigger than 0. 

These estimates of Bi are plotted as the fourth 
and final Model-II rule in Figure 4, labeled "F," 
giving a curve that is almost identical to the MLE 
shrinkages. Data analysts can use this easy "rule-of- 
thumb" that can be based on regression outputs for 
anticipating individual and overall shrinkages, with- 
out computing more precise shrinkage estimates. For 
the 31 hospitals S = 41.59 and Bh = 0.697, sug- 
gesting that a good Model-II rule would outper- 
form both the individual estimates yi and the fully 
shrunken estimates, alike. 

4. SUMMARY AND CONCLUSIONS 

We have reviewed a special and relatively simple 
class of hierarchical models, models for Normal dis- 
tributions that have received significant attention 
from a nonasymptotic (in k) decision-theoretic per- 
spective. Early equal-variance Model-I shrinkage es- 
timators, evaluated by a (unweighted) sum of squa- 
red errors criterion, were found that provided Mo- 
del-I minimaxity and even admissibility. That ope- 
ned exciting new vistas. However, the great prepon- 
derance of applications (even when Normal distribu- 
tions apply) arise with unequal variances, and there 
Model-II evaluations are seen to be much more ap- 
propriate. Model-II evaluations are both less and 
more general than Model-I, less because they av- 
erage over the Level-II parameters, and more gen- 
eral by not requiring judgements about appropriate 
weights for component losses, and also by empow- 
ering interval estimation. A Level-II exchangeability 



assumption, for example, as in Table 2, enables com- 
ponentwise Model-II dominance to be possible. 

Many more investigations are needed in the Mo- 
del-II setting for small and moderate numbers k of 
random effects /x = (/xi, . . . ,/ifc)'- Does Stein's har- 
monic prior (SHP) transport to the unequal vari- 
ance case, for example, by offering Model-II com- 
ponentwise minimaxity, conditionally on all hyper- 
parameters, especially on all A>0? Our experience 
suggests that this is entirely possible for both the 
equal and the unequal variances settings, but there 
are no formal proofs yet. Does the full posterior dis- 
tribution, geared to offer 95% posterior probability 
of coverage for fixed data with the SHP prior, pro- 
vide intervals that cover at least 95% of the time? 
Showing this with Model-II would require at least 
95% coverage for every fixed value of ^ > that 
holds for every component (e.g., for every hospital), 
after averaging over both levels of Model-II. If in- 
tervals cover less than 95% of the cases, how close 
does the minimum coverage come to 95%? How well 
and when do relatively simple methods for estimat- 
ing shrinkages work, like MLE and ADM methods? 
What Level-Ill priors lead to Model-II dominance 
by providing componentwise minimaxity and con- 
fidence intervals that are shorter for every compo- 
nent? Do SHP intervals cover every yUj more often 
for every i,A than do the standard (unshrunken) 
confidence intervals used by data analysts? 

These theoretical questions about operating char- 
acteristics under Model-II evaluations can be asked 
for other yet more complicated models, especially 
for other distributions at Level-I and at Level-II. 
Shrinkage estimators arise when fitting generalized 
linear multilevel models to data that follow expo- 
nential families at Level-I, if conjugate distributions 
are used for the Level-II random effects. That is, 
just as Normal conjugate distributions are used at 
Level-II in Table 2, Gamma distributions are conju- 
gate when Level-I specifies Poisson likelihoods, and 
Betas are conjugate for Binomial likelihoods. The 
advantage of conjugate distributions at Level-II is 
that shrinkage factors arise in conditional means, 
given the observations. Crucially, conjugate distri- 
butions are relatively robust, having the virtue of 
being "G2 minimax" among all possible Level-II dis- 
tributions (priors) in the sense of Jackson et al. 
[13, 18]. This helps make shrinkage estimators sim- 
ple and robust. Shrinkage factors also provide useful 
summaries, so can serve a purpose like R^ does with 
OLS regressions. 
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We have argued that Model-II and it's exchange- 
abihty assumptions are more appropriate than Mo- 
del-I for developing and evaluating shrinkage estima- 
tors. This holds especially for applications in which 
improvements would be expected to hold for ev- 
ery /ij . Hospital directors might agree to having their 
own hospital's performance be estimated by combin- 
ing information from other hospitals, but not unless 
each was assured that doing so would make their 
own hospital's estimate more accurate. 

This paper argues especially that evaluations of 
shrinkage methods for unequal variance data have 
received too little attention, relative to the large lit- 
erature on the Normal equal variances case. It is 
time to change that. 
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